Circular Sequence Comparison with q-grams

نویسندگان

  • Roberto Grossi
  • Costas S. Iliopoulos
  • Robert Mercas
  • Nadia Pisanti
  • Solon P. Pissis
  • Ahmad Retha
  • Fatima Vayani
چکیده

Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialized alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed efficiently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate ordersof-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exact Circular Pattern Matchings Using Bit-Parallelism and q-Gram Technique∗

We present three efficient algorithms for exact circular string matching. One of the algorithms is for single circular pattern and the others are for multiple circular patterns. Our algorithms apply q-grams and bit parallelism. The algorithms are given names CBNDMq, CMultiBNDM and CMultiBNDMq, respectively. These two problems can also be solved by some proposed multiple patterns matching algori...

متن کامل

A faster and more accurate heuristic for cyclic edit distance computation

Sequence comparison is the core computation of many applications involving textual representations of data. Edit distance is the most widely used measure to quantify the similarity of two sequences. Edit distance can be defined as the minimal total cost of a sequence of edit operations to transform one sequence into the other; for a sequence x of length m and a sequence y of length n , it can b...

متن کامل

Q-gram Analysis and Urn Models

Words of fixed size q are commonly referred to as q-grams. We consider the problem of q-gram filtration, a method commonly used to speed up sequence comparison. We are interested in the statistics of the number of q-grams common to two random texts (where multiplicities are not counted) in the non uniform Bernoulli model. In the exact and dependent model, when omitting border effects, a q-gram ...

متن کامل

Indexing DNA Sequences Using q-Grams

We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are ...

متن کامل

Similarity Joins of Text with Incomplete Information Formats

Similarity join over text is important in text retrieval and query. Due to the incomplete formats of information representation, such as abbreviation and short word, similarity joins should address an asymmetric feature that these incomplete formats may contain only partial information of their original representation. Current approaches, including cosine similarity with q-grams, can hardly dea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015